Skip to content

feat(rag): add PowerPoint (.pptx) extraction with VLM image support#1224

Merged
kovtcharov-amd merged 7 commits into
mainfrom
kalin/pptx-extraction
Jun 1, 2026
Merged

feat(rag): add PowerPoint (.pptx) extraction with VLM image support#1224
kovtcharov-amd merged 7 commits into
mainfrom
kalin/pptx-extraction

Conversation

@kovtcharov-amd
Copy link
Copy Markdown
Collaborator

Uploading a .pptx told users to "save as PDF first." Now GAIA indexes PowerPoint natively — extracting text from shapes, tables, and speaker notes, with VLM analysis of embedded images when available. The implementation mirrors the existing PDF pipeline (same VLMClient integration, [Page N] markers, merge strategy, metadata structure) so the downstream chunking and retrieval paths work unchanged.

A zip bomb guard checks uncompressed size before opening (500 MB limit), and WMF/EMF metafiles are skipped gracefully since PIL cannot decode them.

Test plan

  • 17 new unit tests pass (text, tables, notes, page markers, empty/blank/corrupt, VLM mocked on/off, dispatcher, utils)
  • 199 existing regression tests pass (PDF extraction, file tools, server upload validation)
  • Lint clean (pylint, flake8, black, isort)
  • Tested on real .pptx file (7 slides, 925 KB) — 1,448 chars extracted in 0.02s
  • CI passes

PPTX files were explicitly excluded from GAIA's RAG pipeline — users had to
save as PDF first. This adds native extraction of text (shapes, tables, speaker
notes) and embedded images (via VLM when available), mirroring the existing
PDF pipeline pattern.

New module pptx_utils.py handles slide-level extraction with group shape
recursion and markdown table formatting. A zip bomb guard rejects files whose
uncompressed size exceeds 500 MB before handing them to python-pptx.
@github-actions github-actions Bot added dependencies Dependency updates rag RAG system changes tests Test changes performance Performance-critical changes labels May 28, 2026
@kovtcharov-amd kovtcharov-amd requested a review from itomek-amd May 28, 2026 20:28
@kovtcharov-amd kovtcharov-amd self-assigned this May 28, 2026
@kovtcharov-amd
Copy link
Copy Markdown
Collaborator Author

@claude review this PR.

@kovtcharov-amd kovtcharov-amd added this to the Knowledge Agent milestone May 29, 2026
@kovtcharov-amd
Copy link
Copy Markdown
Collaborator Author

@claude review this PR

@github-actions
Copy link
Copy Markdown
Contributor

Solid PR — the per-slide loop, VLM integration, [Page N] markers, and merge strategy faithfully mirror _extract_text_from_pdf, the tests are hermetic (fixtures built with python-pptx, no committed binaries), and the zip-bomb guard is a nice safety addition the PDF path doesn't even have. A few things worth a look before merge:

🟡 Error handling diverges from the PDF pattern it mirrors. PDF failures raise typed exceptions (CorruptedPDFError/EmptyPDFError, subclasses of PDFExtractionError) which the indexing wrapper catches at src/gaia/rag/sdk.py:2678 to set stats["pdf_status"] = e.status — a status code UIs use to badge the document. The PPTX path raises plain ValueError, so corrupt/empty .pptx files fall through to the generic except Exception at sdk.py:2688: the user still gets the actionable message, but no status badge. Relatedly, metadata["pptx_status"] (sdk.py:1118) is computed but never propagated into the index stats, so it's effectively dead. Consider a small PPTXExtractionError(PDFExtractionError) (or a shared base) + wiring pptx_status into stats, so PPTX and PDF behave identically downstream.

🟢 Markdown table rows are single-line. _table_to_markdown (pptx_utils.py:276) joins each row on one line, but a PPTX cell can hold multi-paragraph text — a cell containing a newline would break the row's column alignment. A .replace("\n", "<br>") (or space) alongside the existing |-escaping would harden it. Minor, since most slide tables are single-line cells.

🟢 Two except …: pass blocks (sdk.py:457 count-images, sdk.py:510 cleanup). These mirror the PDF path so they're not new precedent, but per CLAUDE.md's "No Silent Fallbacks" the count-images one silently drops images on any error — a self.log.debug(...) would make a failure diagnosable rather than invisible.

One process note: this adds a new document type feeding RAG retrieval. Unit tests cover extraction mechanics well, but if you haven't already, a quick gaia eval agent --category rag_quality on a PPTX-backed scenario would confirm the extracted text actually retrieves well end-to-end (CLAUDE.md flags RAG-affecting paths for eval).

None of these are blockers — the core extraction logic looks correct and well-tested. The error-typing one is the only thing I'd genuinely recommend addressing for parity with PDF.

@github-actions
Copy link
Copy Markdown
Contributor

Solid implementation that cleanly mirrors the PDF pipeline — the zip bomb guard, VLM integration, and [Page N] marker approach all land correctly. One structural bug needs fixing before merge.


Issues

🟡 pptx_utils functions imported inside the wrong try block (sdk.py)

extract_text_from_slide and extract_notes_from_slide are imported inside the VLM-init try/except block, but they are called unconditionally in the slide loop — regardless of whether VLM is available. If from gaia.llm import VLMClient raises any exception, the slide loop crashes with NameError instead of gracefully falling back to text-only mode.

Compare this with _extract_text_from_pdf (sdk.py:668-673), where count_images_in_page and extract_images_from_page_pymupdf are also in the VLM block — but there they're only used inside if vlm_available:, so the bundling is safe. In the PPTX path that invariant is broken.

The fix is to move the unconditionally-needed imports out:

            from gaia.rag.pptx_utils import (  # pylint: disable=import-outside-toplevel
                count_images_in_slide,
                extract_images_from_slide,
                extract_notes_from_slide,
                extract_text_from_slide,
            )

            # Initialize VLM client (auto-enabled if available)
            vlm = None
            vlm_available = False
            try:
                from gaia.llm import (  # pylint: disable=import-outside-toplevel
                    VLMClient,
                )

                vlm = VLMClient(
                    vlm_model=self.config.vlm_model, base_url=self.config.base_url
                )
                vlm_available = vlm.check_availability()

🟡 Docs not updated for PPTX support

docs/guides/chat.mdx:105 still reads "chatting with PDF documents" and docs/sdk/sdks/rag.mdx describes the pipeline as PDF-only throughout. CLAUDE.md requires every new feature to be documented. A one-line update to the supported-format list in each file is sufficient.


🟢 Silent except Exception: pass on count_images_in_slide (sdk.py)

This violates CLAUDE.md's "No Silent Fallbacks" rule. After the exception has_imgs remains False, so the fallback is correct, but the suppression is invisible. One logger.debug() would satisfy the rule.

                    except Exception as e:  # pylint: disable=broad-except
                        self.log.debug("count_images_in_slide failed on slide %d: %s", i, e)

(Same fix applies to the VLM cleanup() call at sdk.py:1063-1064 — note this is pre-existing in the PDF path too, so fixing it here sets the right precedent.)

🟢 slide_num is dead weight in extract_text_from_slide (pptx_utils.py:194)

The # pylint: disable=unused-argument annotation is a code smell. Either add one logger.debug("Extracting text from slide %d", slide_num) to justify keeping it, or drop the parameter (the caller can log).

def extract_text_from_slide(slide, slide_num: int) -> str:

→ add logger.debug("Extracting text from slide %d", slide_num) at the top of the function body, and remove the pylint suppression.

🟢 from typing import List, Tuple (pptx_utils.py:10)

GAIA targets Python 3.10+. Use the built-in generics.

from __future__ import annotations

(add at the top, then replace List[dict]list[dict] and Tuple[bool, int]tuple[bool, int])

🟢 _iter_shapes group-recursion path not tested

The depth-limit guard in _iter_shapes is the only non-trivial logic in pptx_utils.py without a test. A slide with a nested group shape (and one at depth ≥ 5) would cover it; the fixture helper in test_pptx_extraction.py could add this cheaply.


Strengths

  • Zip bomb guard (sdk.py) is a thoughtful, proactive security measure — checks total uncompressed size before handing the file to python-pptx, with an actionable user-facing error.
  • Tight parity with the PDF pipeline: [Page N] markers, VLM merge strategy, _extract_text_from_file dispatcher, and metadata keys all slot in without touching any existing callers.
  • Test suite is genuinely useful: programmatic fixture construction (no binary PPTX files committed), VLM mocked on/off, dispatcher test, and direct utils coverage. The 199 regression tests passing confirms no regressions in the PDF / file-tools paths.

Verdict

Request changes — one structural fix required before merge (imports in the wrong try block). The docs gap and minor nits can be addressed here or as a follow-up, but the import ordering needs to land with this PR to prevent a latent NameError on VLM-unavailable environments.

@github-actions github-actions Bot added the documentation Documentation changes label May 29, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Solid, well-scoped feature. The PPTX pipeline faithfully mirrors _extract_text_from_pdf — same VLMClient integration, [Page N] markers, merge strategy, and metadata shape — so the downstream chunking and retrieval paths work without modification. The zip bomb guard is a nice addition that the PDF path doesn't have. Test coverage is thorough (17 cases, programmatic fixtures, no committed binaries).

Two minor issues worth fixing before merge, then a handful of nits.


Issues Found

🟡 import zipfile is inside the function body — move to module level (src/gaia/rag/sdk.py:886)

zipfile is a stdlib module that's unconditionally required by _extract_text_from_pptx. Burying it inside the function body was the right move for from pptx import Presentation (optional-dep guard) but not for zipfile. It belongs at the top of the file alongside the other stdlib imports.

import zipfile

Add it to the existing stdlib block at the top of sdk.py (after line 17 import time) and remove the inline import zipfile at line 886.


🟡 pptx_status is computed but never surfaced in index_document stats (src/gaia/rag/sdk.py:2677)

index_document sets stats["pdf_status"] = "readable" for successful PDFs (line 2677) and captures specific status codes via PDFExtractionError on failure (line 2689). PPTX has no equivalent — successful PPTX indexing leaves the stats dict without a pptx_status key, and PPTX errors fall through to the generic handler with only stats["error"], no status code. The metadata returned by _extract_text_from_pptx already computes pptx_status: "readable", so this is a one-liner fix:

            if file_type == ".pdf":
                stats["pdf_status"] = "readable"
            elif file_type == ".pptx":
                stats["pptx_status"] = "readable"

Any UI or caller that reads stats["pptx_status"] for display/badging (similar to the encrypted/corrupted PDF badges) would KeyError without this.


🟢 from typing import List, Tuple — project requires Python 3.10+ (src/gaia/rag/pptx_utils.py:104)

setup.py sets python_requires=">=3.10", so the built-in generics work. pdf_utils.py also uses the legacy form (pre-existing), but since this is a new file it's a clean opportunity to use the modern style.

from __future__ import annotations

Then replace List[dict]list[dict] and Tuple[bool, int]tuple[bool, int] throughout the file. The from __future__ import annotations at the top of the file covers the return-type annotations.


🟢 count_images_in_slide is imported inside the VLM try/except block (src/gaia/rag/sdk.py:951)

count_images_in_slide and extract_images_from_slide are pptx_utils helpers that have no VLM dependency — they extract PIL images from slides. Bundling their import with the VLMClient import means they're silently unavailable if VLM init throws, even though the import itself would succeed. Moving them to the earlier from gaia.rag.pptx_utils import extract_notes_from_slide, extract_text_from_slide block (line 939) is cleaner.


🟢 txBox in tests violates PEP 8 naming (tests/unit/rag/test_pptx_extraction.py:1089, 1110)

        tx_box = slide.shapes.add_textbox(Inches(3), Inches(1), Inches(3), Inches(1))
        tx_box.text_frame.text = "Slide with image and text"

(Same at line 1110 in test_vlm_processes_images.)


Strengths

  • Zip bomb guard — checking uncompressed size before handing the file to python-pptx is a genuine security improvement the PDF path doesn't have. The 500 MB limit and actionable error message are well-calibrated.
  • Test fixtures — building PPTX programmatically with python-pptx keeps the test suite hermetic (no committed binary fixtures) while covering text, tables, notes, page markers, empty/blank/corrupt, VLM mocked on/off, and dispatcher routing. The test_vlm_processes_images patch dance to override the fixture's default VLM mock is a clean solution.
  • Graceful WMF/EMF handling — skipping metafiles with a per-image warning rather than failing the whole slide is the right tradeoff; PIL genuinely can't decode them.

Verdict

Approve with suggestions. The two 🟡 items are small but worth landing before merge — the import zipfile is a trivial move, and the missing pptx_status in stats will surface as a KeyError the moment any UI code tries to badge PPTX documents the same way it badges PDFs.

@github-actions
Copy link
Copy Markdown
Contributor

Clean, well-scoped addition that makes PPTX indexing work the way users expect — the existing PDF pipeline carries over naturally (same VLMClient integration, [Page N] markers, merge strategy, metadata shape). One test fragility issue needs fixing before merge; everything else is advisory.


Issues

🟡 extract_text_from_slide carries an unused parameter (pptx_utils.py:167)

slide_num is declared, documented as "for logging", then suppressed with # pylint: disable=unused-argument. Either drop the parameter (the caller doesn't need to pass it) or add a logger.debug("Extracting text from slide %d", slide_num) so the argument earns its place. The pylint disable comment signals the inconsistency.

def extract_text_from_slide(slide) -> str:

… and remove slide_num from every call site in sdk.py.


🟡 test_vlm_processes_images leaves the patch in a bad state on failure (test_pptx_extraction.py:346)

rag._test_vlm_patch.stop()
with patch("gaia.llm.VLMClient", return_value=fake_vlm):
    text, _, metadata = rag._extract_text_from_pptx(str(pptx_path))

# Re-start the original patch for fixture cleanup
rag._test_vlm_patch.start()   # ← skipped if assertion inside `with` raises

If anything inside the with block raises (assertion error, unexpected exception), the start() call is bypassed. The fixture teardown then calls .stop() on an already-stopped patcher, raising RuntimeError and masking the original failure. Use a finally block or restructure to avoid the manual stop/start entirely:

    rag._test_vlm_patch.stop()
    try:
        with patch("gaia.llm.VLMClient", return_value=fake_vlm):
            text, _, metadata = rag._extract_text_from_pptx(str(pptx_path))
    finally:
        rag._test_vlm_patch.start()

🟡 Zip bomb guard is untested (sdk.py:399–416)

The 500 MB uncompressed-size check is the one security-relevant path in this PR and has zero test coverage. A synthetic zip that reports a large file_size in its central directory without actually containing that data is straightforward to construct in a test.

def test_zip_bomb_guard_raises(self, rag, tmp_path):
    """PPTX with a single entry claiming 600 MB uncompressed raises ValueError."""
    import io, struct, zipfile as zf

    pptx_path = tmp_path / "bomb.pptx"
    buf = io.BytesIO()
    with zf.ZipFile(buf, "w", zf.ZIP_STORED) as z:
        # Write a tiny payload but lie about uncompressed size in the header
        info = zf.ZipInfo("slides/slide1.xml")
        info.file_size = 600 * 1024 * 1024  # 600 MB — exceeds 500 MB limit
        info.compress_size = 10
        # Write raw: python-zipfile won't let us fake file_size easily,
        # so just verify that a real 600 MB file triggers the guard.
        pass  # Replace with a real crafted zip approach, or skip if too brittle

    with pytest.raises(ValueError, match="too large after decompression"):
        rag._extract_text_from_pptx(str(pptx_path))

At minimum, document this in the test-plan comment so reviewers know coverage is intentionally deferred.


🟢 import zipfile should be top-level (sdk.py:~886)

time, os, re, and every other stdlib module used by this file are imported at the top. zipfile sneaks in mid-function alongside import time as time_module (a pre-existing pattern not worth propagating). Move it to the module-level block.

import zipfile

(add after the existing import time on line 17)

Then remove the inline import zipfile from _extract_text_from_pptx.


🟢 _iter_shapes silently drops shapes at max depth (pptx_utils.py:326–331)

When depth >= MAX_GROUP_DEPTH, the group's children are silently skipped. For a pathological presentation this means content is lost with no indication. A logger.warning here would at least surface the issue in logs:

        else:
            logger.warning(
                "Group shape nesting exceeds MAX_GROUP_DEPTH (%d); skipping children",
                MAX_GROUP_DEPTH,
            )

Strengths

  • Zip bomb guard (sdk.py:399–416) is a proactive and rare-to-see safety check for ZIP-container formats. The 500 MB uncompressed limit and the actionable error message ("zip bomb or very large embedded media") are both exactly right.
  • Programmatic test fixtures in test_pptx_extraction.py avoid committing binary .pptx blobs while still exercising the full extraction stack — text, tables, speaker notes, images, VLM on/off, dispatcher routing. The 17-test suite covers the expected failure modes thoroughly.
  • _table_to_markdown correctly escapes | inside cell text (replace("|", "\\|")), a subtle correctness detail that most first-pass implementations miss.

Verdict

Approve with suggestions — the test_vlm_processes_images patch fragility is the one item that should land before merge (it can silently mask test failures on CI); the rest are advisory. All other paths are solid.

Move `import zipfile` to module-level, surface `pptx_status` in
index_document stats, harden test VLM patch with try/finally, use
modern type hints, add logger.debug for slide_num, warn on group
shape depth limit, and fix txBox → tx_box naming.
@github-actions
Copy link
Copy Markdown
Contributor

Solid feature addition with a clean mirror of the existing PDF pipeline. One docstring claim about pptx_status doesn't match the code, and a couple of import-organisation nits are worth tidying; nothing blocking.


Summary

The PPTX path faithfully replicates the PDF pipeline — same [Page N] markers, VLM integration, merge strategy, and metadata shape — so downstream chunking and retrieval require no changes. The zip bomb pre-check and comprehensive 17-test suite are highlights. Two minor correctness gaps and one import-organisation nit are called out below.


Issues Found

🟡 pptx_status "corrupted" value is unreachable (src/gaia/rag/sdk.py:869)

The docstring for _extract_text_from_pptx promises pptx_status can be "readable" or "corrupted", but the metadata dict is only built on the success path and always set to "readable". Corrupted/empty files raise ValueError, which bypasses the metadata construction entirely. At the index_document level, stats["pptx_status"] = "readable" (line 2682) is likewise only reached on success; the except Exception fallback (line 2695) sets stats["error"] but never stats["pptx_status"]. Contrast this with the PDF path where typed PDFExtractionError subclasses carry a .status code that the dedicated except PDFExtractionError block surfaces in stats["pdf_status"].

No UI currently reads pptx_status, so this isn't breaking today, but the docstring is misleading and the pattern diverges from the PDF handler. The minimal fix is to correct the docstring; the complete fix also adds typed PPTX error classes mirroring CorruptedPDFError / EmptyPDFError:

            - pptx_status: str (always ``"readable"`` — only populated on success)

(at sdk.py:869)

And at the index_document error handler (after line 2694), mirror the PDF pattern:

        except ValueError as e:
            # Covers CorruptedPPTXError / EmptyPPTXError once typed classes are added,
            # or plain ValueError from the current implementation.
            stats["error"] = str(e)
            stats["pptx_status"] = "corrupted"
            return stats

🟢 count_images_in_slide / extract_images_from_slide imported inside the VLM try block (src/gaia/rag/sdk.py:951–953)

These two pptx_utils functions have no VLM dependency — they just inspect shapes. Importing them inside the try: VLMClient(...) block means if VLM init raises (e.g., ImportError for gaia.llm), they silently become undefined. The code is safe because vlm_available=False guards all call sites, but the coupling is invisible.

            from gaia.rag.pptx_utils import (  # pylint: disable=import-outside-toplevel
                count_images_in_slide,
                extract_images_from_slide,
                extract_notes_from_slide,
                extract_text_from_slide,
            )

            # Initialize VLM client (auto-enabled if available)
            vlm = None
            vlm_available = False

(consolidate both pptx_utils imports into one, before the try: VLMClient(...) block)


🟢 "Shared constants" comment is misleading (src/gaia/rag/pptx_utils.py:109)

MAX_DIMENSION, MAX_SIZE_KB, MAX_ITERATIONS are module-level copies of values defined inline inside pdf_utils.py — they are parallel, not actually shared via a common import. The comment sets a false expectation that changing one will change the other.

# Image processing constants — keep in sync with pdf_utils.py
MAX_DIMENSION = 1600
MAX_SIZE_KB = 300
MAX_ITERATIONS = 5
MAX_GROUP_DEPTH = 5

Strengths

  • Zip bomb guard (sdk.py:889–917) — pre-checking total uncompressed size via zipfile.ZipFile before handing the file to python-pptx is a thoughtful, low-cost security measure that the PDF path lacks.
  • Hermetic test suite — all 17 tests build .pptx fixtures programmatically with python-pptx rather than committing binary blobs; the VLM mock toggle pattern (_test_vlm_patch.stop() / start() in test_vlm_processes_images) cleanly isolates the happy path without polluting other tests.
  • Allowlist consistencyutils.py, files.py, and test_server.py are all updated in lock-step; the rejection message in files.py:724–726 distinguishes legacy .ppt from modern .pptx clearly.

Verdict

Approve with suggestions — the pptx_status docstring fix is worth a quick edit before merge; the import consolidation is a nit. Core implementation is correct and well-tested.

…y extraction

When PowerPoint is installed (Windows), automatically converts PPTX to PDF
and feeds it through the existing PDF pipeline -- capturing charts, SmartArt,
and visual layout that python-pptx cannot access. Speaker notes are still
extracted via python-pptx since they do not appear in PDF renders.

Falls back gracefully to python-pptx native extraction when PowerPoint is
not available (Linux, Mac, or no Office installed).
@github-actions
Copy link
Copy Markdown
Contributor

PPTX extraction lands cleanly with solid test coverage and a thoughtful fallback chain. Two issues need to be addressed before merge: a temp-directory leak on every non-Windows PPTX indexing call, and a path injection risk in the PowerShell COM script.


Summary

The implementation correctly mirrors the PDF pipeline — [Page N] markers, VLM integration, metadata structure, and dispatcher routing all transfer naturally. The zip bomb guard, programmatically-generated test fixtures, and the COM→python-pptx fallback logic are all well-designed. The two blocking issues below affect correctness and security and should be fixed before merge.


Issues

🟡 Important — Temp directory leaked when conversion returns None (sdk.py:950–960)

mkdtemp is called unconditionally, but cleanup lives inside if pdf_conversion_path: … finally:. On every non-Windows platform (and on Windows without PowerPoint), convert_pptx_to_pdf returns None with no exception — so pdf_conversion_path is falsy, the if block is skipped, and tmp_dir is never removed. Every PPTX indexing call on macOS/Linux leaks a directory.

            pdf_conversion_path = None
            tmp_dir = None
            try:
                import tempfile as _tempfile

                tmp_dir = _tempfile.mkdtemp(prefix="gaia_pptx_")
                pdf_conversion_path = convert_pptx_to_pdf(
                    str(Path(pptx_path).resolve()), tmp_dir
                )
            except Exception as conv_err:
                self.log.debug("PPTX→PDF conversion not available: %s", conv_err)

            if pdf_conversion_path:
                try:

And add a matching else branch after the if pdf_conversion_path: block's finally:

            else:
                if tmp_dir:
                    import shutil
                    shutil.rmtree(tmp_dir, ignore_errors=True)

🔒 SECURITY CONCERN — Path injection in PowerShell script (pptx_utils.py:407,412)

@kovtcharov-amdpptx_abs is interpolated directly into a PowerShell single-quoted string with no escaping. A file whose path contains a ' character (valid on Windows — e.g., C:\Users\O'Brien\slides.pptx) breaks out of the string literal and enables arbitrary PowerShell injection. PowerShell escapes literal single quotes by doubling them ('').

    # Escape single quotes for PowerShell single-quoted strings.
    pptx_abs_ps = pptx_abs.replace("'", "''")
    pdf_abs_ps = pdf_abs.replace("'", "''")

    ps_script = (
        "$ErrorActionPreference = 'Stop'; "
        "$ppt = New-Object -ComObject PowerPoint.Application; "
        "try { "
        f"  $pres = $ppt.Presentations.Open('{pptx_abs_ps}', "
        "    [int]-1, "  # ReadOnly = msoTrue
        "    [int]0, "  # Untitled = msoFalse
        "    [int]0"  # WithWindow = msoFalse
        "  ); "
        f"  $pres.SaveAs('{pdf_abs_ps}', 32); "  # 32 = ppSaveAsPDF
        "  $pres.Close(); "
        "} finally { "
        "  $ppt.Quit(); "
        "}"
    )

🟢 Minor — import shutil inside finally block (sdk.py:1005)

shutil is stdlib and always available; importing it inside a finally block is unusual and slightly risky if something interferes with import machinery during cleanup. Move it to the top of the method alongside the other lazy imports, or hoist it above the if pdf_conversion_path: block.

            if pdf_conversion_path:
                import shutil
                try:

🟢 Minor — Redundant import time as time_module (sdk.py:871)

time is already imported at module scope (line 17). The # pylint: disable=reimported suppressor acknowledges the conflict. Since _extract_text_from_pdf (which this mirrors) has the same pattern, this is a pre-existing convention — but it's worth cleaning both up to just use time.time() directly when the opportunity arises.


Strengths

  • Test fixtures are hermetic — building .pptx files programmatically rather than committing binaries means tests stay reproducible and diff-able.
  • Zip bomb guard is well-targeted: checks uncompressed size before python-pptx opens the file, with a user-actionable error message.
  • Fallback chain design — COM fast path only attempted, finally correctly cleans up when the fast path is taken, and the python-pptx fallback is reached on exception without masking the original error.
  • [Page N] marker preservation in both the native extraction and the COM/PDF path means downstream chunking and citation attribution work unchanged.

Verdict

Request changes — the temp directory leak (🟡) affects all non-Windows users on every PPTX index call, and the PowerShell injection (🔒) is a real vector on Windows with unusual-but-valid filenames. Both fixes are small and mechanical; once applied this is ready to merge.

Builds a test PPTX programmatically (no sensitive files), indexes it through
the full pipeline (COM conversion, PDF extraction, VLM, embeddings), then
verifies LLM answers contain expected facts (cost, location, team members,
speaker notes). Skips when Lemonade is not running.
@github-actions
Copy link
Copy Markdown
Contributor

Clean feature addition that mirrors the PDF pipeline precisely — same [Page N] markers, VLMClient integration, metadata keys, and merge strategy mean no downstream changes required. Two issues need to be addressed before merge.


Issues Found

🔴 Critical — Path injection in PowerShell script (pptx_utils.py:317,322)

🔒 SECURITY CONCERN @kovtcharov-amdpptx_abs and pdf_abs are interpolated directly into single-quoted PowerShell string literals. A Windows path containing an apostrophe (e.g. C:\Users\O'Connor\slides.pptx) terminates the string literal early. Because pptx_path is user-controlled (CLI --index or an uploaded filename), a crafted path can inject arbitrary PowerShell. The comment on line 309 notes single quotes "handle spaces correctly" but does not account for apostrophes in the path itself.

Please discuss privately — details should not be in the public PR thread.

    pptx_abs = str(Path(pptx_path).resolve()).replace("'", "''")
    pdf_name = Path(pptx_path).stem + ".pdf"
    pdf_abs = str(Path(output_dir).resolve() / pdf_name).replace("'", "''")

Doubling ' to '' is the standard PowerShell single-quoted-string escape.


🟡 Important — Temp directory leaked on every non-Windows call (sdk.py fast-path block)

mkdtemp(prefix="gaia_pptx_") runs unconditionally, but shutil.rmtree only executes inside the if pdf_conversion_path: branch's finally. On Linux/macOS, convert_pptx_to_pdf returns None immediately, so the empty directory leaks on every PPTX indexing call. Add cleanup in the else branch:

            if pdf_conversion_path:
                try:
                    if self.config.show_stats:
                        print(
                            "  📊 Using PowerPoint→PDF conversion for full-fidelity extraction"
                        )
                    self.log.info("Using PowerPoint→PDF conversion for %s", file_name)
                    pdf_text, pdf_pages, pdf_metadata = self._extract_text_from_pdf(
                        pdf_conversion_path
                    )

                    # Append speaker notes (not in PDF render)
                    notes_parts = []
                    for i, slide in enumerate(prs.slides, 1):
                        notes = extract_notes_from_slide(slide)
                        if notes:
                            notes_parts.append(
                                f"\n[Page {i}] **Speaker Notes:**\n{notes}"
                            )
                    if notes_parts:
                        pdf_text += "\n\n" + "\n".join(notes_parts)

                    metadata = {
                        "num_slides": pdf_pages,
                        "vlm_slides": pdf_metadata.get("vlm_pages", 0),
                        "total_images": pdf_metadata.get("total_images", 0),
                        "vlm_checked": pdf_metadata.get("vlm_checked", False),
                        "vlm_available": pdf_metadata.get("vlm_available", False),
                        "pptx_status": "readable",
                        "conversion": "powerpoint_com",
                    }

                    extract_duration = time_module.time() - extract_start
                    self.log.info(
                        f"📝 Extracted {len(pdf_text):,} characters via PDF conversion in {extract_duration:.2f}s"
                    )
                    return pdf_text, pdf_pages, metadata
                except Exception as pdf_err:
                    self.log.warning(
                        "PDF extraction from converted PPTX failed, falling back: %s",
                        pdf_err,
                    )
                finally:
                    import shutil

                    if tmp_dir:
                        shutil.rmtree(tmp_dir, ignore_errors=True)
            elif tmp_dir:
                import shutil

                shutil.rmtree(tmp_dir, ignore_errors=True)

🟢 Minor — Bare except Exception in extract_notes_from_slide (pptx_utils.py:219)

Speaker notes are non-critical, so swallowing errors here is understandable, but GAIA convention asks for a narrower catch. AttributeError and TypeError cover the real failure modes (missing notes_text_frame, None text):

    except (AttributeError, TypeError) as e:
        logger.debug("Could not extract notes from slide: %s", e)

Strengths

  • ZIP bomb guard (sdk.pyzipfile.infolist() pre-check) is a genuinely useful safety addition that mirrors nothing in the PDF path; good initiative.
  • Test completeness: 17 hermetic unit tests (programmatic fixtures, no committed binaries) covering text/tables/notes, page markers, VLM mocked on/off, dispatcher routing, empty/blank/corrupt edge cases, and both conversion fast-path branches. The integration test verifies the full RAG → LLM pipeline against real Lemonade with known-content Q&A.
  • Exact parity with PDF pipeline[Page N] markers, _merge_page_texts, VLMClient.extract_from_page_images, and metadata keys all unchanged, so chunking and retrieval require zero adaptation.

Verdict

Request changes — the PowerShell injection (🔴) must be fixed, and the temp-dir leak (🟡) should be resolved in the same commit. The minor notes-exception narrowing can be included or left for a follow-up.

Copy link
Copy Markdown
Collaborator

@itomek itomek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The extraction logic is in good shape and you've already cleared most of the earlier review notes (imports reordered so the text/notes helpers load before the VLM try-block, count_images_in_slide failure now logs, zipfile hoisted to module scope, _iter_shapes depth-skip now warns, slide_num now used). Faithful PDF-pipeline parity, dep declared (python-pptx in both rag extras), docs (chat.mdx/rag.mdx) and the allowlist (files.py/utils.py) all updated consistently, plus 17 unit tests. Approving.

Two items worth fixing here or filing a follow-up for, noted inline — neither leaves the machine, so not blocking, but both are real:

  • Temp-dir leak on the non-Windows path: convert_pptx_to_pdf returns None on Linux/macOS, so the unconditional mkdtemp is never cleaned up — an empty gaia_pptx_* dir leaks on every PPTX index. One-line elif fix.
  • PowerShell path interpolation in convert_pptx_to_pdf breaks on apostrophe paths; escape single quotes before interpolation.
    Plus a multi-line-cell markdown hardening and a missing test for the zip-bomb guard.

One process note per CLAUDE.md: this feeds a new document type into RAG retrieval, so a quick gaia eval agent --category rag_quality against a PPTX-backed scenario would confirm the extracted text actually retrieves well end-to-end, not just that extraction runs.


Generated by Claude Code

Comment thread src/gaia/rag/sdk.py
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Temp-dir leak on the non-Windows path: tmp_dir = mkdtemp(...) runs unconditionally, but the only shutil.rmtree is inside the if pdf_conversion_path: ... finally. On Linux/macOS convert_pptx_to_pdf returns None, so that branch is skipped and the empty gaia_pptx_* dir leaks on every PPTX index. Add an elif tmp_dir: shutil.rmtree(tmp_dir, ignore_errors=True) after the conversion block. (Worth fixing here or filing an issue — it recurs on the primary path.)


Generated by Claude Code

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert_pptx_to_pdf interpolates pptx_abs/pdf_abs into single-quoted PowerShell string literals with no escaping. A path containing an apostrophe (valid on Windows, e.g. C:\Users\O'Brien\deck.pptx) closes the literal early; since the path is user-controlled this breaks conversion. Standard fix is doubling single quotes — pptx_abs.replace("'", "''") (and same for pdf_abs) — before interpolation. _table_to_markdown is also worth hardening for multi-line cell text (.replace("\n", "
")), and the zip-bomb guard has no test yet.


Generated by Claude Code

@kovtcharov-amd kovtcharov-amd merged commit 1f2f7ad into main Jun 1, 2026
55 of 56 checks passed
@kovtcharov-amd kovtcharov-amd deleted the kalin/pptx-extraction branch June 1, 2026 16:20
@kovtcharov-amd kovtcharov-amd mentioned this pull request Jun 1, 2026
6 tasks
itomek added a commit that referenced this pull request Jun 2, 2026
- featureTitle now lists all blocked Office formats (doc, docx, ppt, xls)
  so the auto-filed feature-request title matches the blocklist
- drop the #1224 reference from a test name so the assertion stands alone
pull Bot pushed a commit to bhardwajRahul/gaia that referenced this pull request Jun 3, 2026
The Agent UI hard-blocked PowerPoint uploads even though `.pptx` RAG
shipped end-to-end in amd#1224 — the RAG backend *and* the FastAPI
`ALLOWED_EXTENSIONS` already accept `.pptx`, but the React
`UnsupportedFeature` gate still classified it as an unsupported
"Microsoft Office" type and rejected it *before* upload. A headline
feature was unreachable in the UI, and users were filing feature
requests (amd#1072, amd#1291) for something that already worked. This drops
`.pptx` from the frontend blocklist, adds it to the supported set (now
matching the server allowlist exactly), and fixes the now-false
rejection copy. `.doc/.docx/.ppt/.xls` stay blocked. No server change —
the backend already accepts `.pptx`.

Closes amd#1363

## Test plan
- [x] `cd src/gaia/apps/webui && npm test` — Vitest 18/18 pass locally
(new assertions verify `.pptx` is accepted and `.doc/.ppt/.xls` stay
blocked). Note: the `test-webui-vitest` CI job doesn't currently trigger
on diffs confined to `src/gaia/apps/webui` (pre-existing path-filter gap
in `test_electron.yml`), so this was verified locally, not in CI.
- [x] `cd src/gaia/apps/webui && npm run build` — `tsc && vite build`
clean
- [x] `code-reviewer` pass — no Critical/Advisory findings;
independently confirmed the frontend `SUPPORTED_EXTENSIONS` now matches
the server-side `ALLOWED_EXTENSIONS` and no consumer (`DocumentLibrary`,
`FileBrowser`) breaks
- [x] Real-world (Agent UI on an AMD GPU box, Gemma-4-E4B): uploaded a
`.pptx` through the Document Library → **accepted** (no "Microsoft
Office not supported" toast) → indexed (1 chunk via nomic-embed) → Doc
Agent answered questions grounded in slide body text (92.7% success
rate), a table cell (14.5 Nm torque), and speaker notes (Bluejay-44
codename)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dependencies Dependency updates documentation Documentation changes performance Performance-critical changes rag RAG system changes tests Test changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants